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(54) Speech recognition confidence level display 

(57) A speech recognition system and method indi- 
cates the level of confidence that a speech recognizer 
has in its recognition of one or more displayed words. 
The system and method allow for the rapid identification 



of speech recognition errors. A plurality of confidence 
levels of individual recognized words may be visually in- 
dicated. Additionally, the system and method altow the 
user of the system to select threshold levels to deter- 
mine when the visual indication occurs. 



CM 
< 

00 
CD 

o> 
o 

Q. 
Lil 



6U1DISPIAT 



GUI 

AfPUaitOH^ 
150 



SPEECH ENGINE 
160 



HICROPHOKE 



-170 



AnUBUTE I 

no 

s 

WORD I 



AnniBun 2 

120 



WORD 2 



I WORD} [ ^An 



AnaiBUTEj 

130 



usercomtrolK 

_| 140 



COHFIDENa LEVEL 
tlRllCATORPROaSS 



RECOGHtZER 
^170 



COHFIDENa 
LEVEL 
SCORER 



"•200 



FIG. 1 



100 



220 230 

UL 



WORD/SCORE 
WORIVSCORE 



210 



PnniedbyJouve, 75001 PARIS (FR) 



1 



EP0 924 687 A2 



2 



Description 

BACKGROUND OF THE INVENTION 
Reld oi the Invention 

[0001] This invention relates to the fiekJof speech rec- 
ognition systems. More spedfically, this invention re- 
lates to user interfaces for speech recognition systems, 
and yet more specifically to a method and apparatus for 
assisting a user in reviewing transcription results from 
a speech recognition dictation system. 

Description of the Related Art 

[0002] Text processing systems, e.g. word proces- 
sors with spell checlcers, such as Lotus WordPro™ and 
Word Perfect^ by Novell, can display misspelled words 
(i.e. words not recognized by a dictionary internal to the 
word processor) in a colour different from that of the nor- 
mal text. As a variant, Microsoft Word™ underlines mis- 
spelled words in a colour different from that of the normal 
text. In these cases, it is simple to ascertain the validity 
of a word by checking it against dictionaries. Either a 
word is correctly spelled or it is not. However, these as- 
pects of Known text processing systems deal only with 
possible spelling errors. Additionally, because speli- 
checkers in text processing systems use only a binary, 
true/false criterion to determine whether a word is cor- 
rectly (or possibly incorrectly) spelled, these systems 
will choose one of two colours in which to display the 
word. In other words, there are no shades of gray. The 
word is merely displayed in one colour if it is correctly 
spelled and in a second colour if the system suspects 
the word is incorrectly spelled. Grammar checking sys- 
tems operate similarly, in that the system will choose 
one of two colours in which to display the text depending 
upon whether the system determines that correct gram- 
mar has been used. 

[0003] By contrast, the inventive method and appara- 
tus of the present invention deals with speech recogni- 
tion errors, and in particular with levels of confidence 
that a speech recognition system has in recognizing 
words that are spoken by a user. With the method and 
apparatus of the present invention, an indication is pro- 
duced, which is correlated to a speech recognition en- 
gine's calculated probability that it has correctly recog- 
nized a word. Whether or not a word has been correctly 
recognized, the displayed word will always be correctly 
spelled. Additionally, the inventive system supports mul- 
tiple levels of criteria in determining how to display a 
word by providing a multilevel confidence display. 
[0004] In another area, known data visualization sys- 
tems use colour and other visual attributes to commu- 
nicate quantitative information. For example, an electro- 
encephatograph (EEG) system may display a colour 
contour map of the brain, where colour is an indication 
of amptitude of electrical activity. Additionally, meteoro- 



logical systems display maps where rainfall amounts or 
temperatures may be indicated by different colours. 
Contour maps display altitudes and depths in corre- 
sponding ranges of colours. However, such data visual- 
5 ization systems have not been applied to text, or more 
specifically, to text created by a speech recognition/dic- 
tation system. 

[0005] In yet another area, several speech recogni- 
tion dictation systems have the capability of recognizing 
10 a spoken command. For example, a person dictating 
text, may dictate commands, such as "Underline this 
section of text*, or "Print this document". In these cases, 
when the match between the incoming acoustic signal 
and the decoded text has a low confidence score, the 
IS spoken command is flagged as being unrecognized. In 
such a circumstance, the system will display an indica- 
tion over the user interface, e.g. a question mark or 
some comment such as "Pardon Me?". However, obvi- 
ously such systems merely indicate whether a spoken 
command is recognized and are. therefore, binary, rath- 
er than multilevel, in nature. In the example just given, 
the system indicates that it is unable to carry out the 
user's command. Thus, the user must take some action. 
Such systems fail to deal with the issue of displaying 
text in a manner that reflects the system's varying level 
of confidence in its ability to comply with a command. 
[0006] In yet another area, J.R. Rhyne and G.C. 
Wolf's chapter entitled "Recognition Based User Inter- 
faces, published in Advances in Human -Computer In- 
teraction. 4:216-218, Ablex, 1993. R. Hartson and D. 
Hix, editors, states Ihe interface may highlight the result 
just when the resemblance between the recognition al- 
ternatives are close and the probability of a substitution 
error is high." However, this is just another instance of 
using binary criteria and is to be contrasted with the mul- 
tilevel confidence display of the present invention. Fur- 
thermore, this reference merely deals with substitution 
error and lacks user control, unlike the present invention 
which addresses not only substitution errors but also de- 
letion errors, insert kxi errors, and additionally, provides 
for user control. 

[0007] Traditionalty. when users dictate text using 
speech recognition technology, recognition errors are 
hard to detect. The user typically has to read the entire 
dictated document carefully word by word, looking for 
insertions, deletions and substitutions. For example, the 
sentence "there are no signs of cancer" can become 
"there are signs of cancer" through a deletion error. This 
type of error can be easy to miss when quickly proof 
reading a document. 

[0008] It would be desirable to provide a system that 
displays transcribed text in accordance with the sys- 
tem's level of confidence that the transcription is accu- 
rate. It also would be desirable if such a system could 
display more than a binary indication of its level of con- 
fidence. 
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DISCLOSURE OF THE INVENTION 

[0009] The present invention relates to a speech rec- 
ognition computer systenn and method that indicates the 
level of confidence that a speech recognizer has in one 
or rnore displayed words. The level of confidence is in- 
dicated using an indicator, such as colour, associated 
with the word or words that are displayed on a user in- 
terface. The system has a voice input device, such as 
a microphone, that inputs acoustic signals to the speech 
recognizer. The speech recognizer translates the 
acoustic signal from the voice input device into text. e. 
g. one or more words. A confidence level process in the 
speech recognizer produces a score (confidence level) 
for each word that is recognized. A confidence level in- 
dicator process then produces one. of one or more in- 
dications, associated with each of the one or nrkore 
words displayed on the user interface. The indication is 
related to one of one or more sub-ranges, in which the 
score falls. The words are displayed on a user interface 
as text with the properties of the text (e.g. colour) reflect- 
ing the confidence score. 

BRIEF DESCRIPTION OF THE DRAWINGS 

[001 0] The invention will now be described, by way of 
example only, with reference to the accompanying 
drawings, in which: 

Figure 1 Is a block diagram of a preferred embodi- 
ment ot the present Invention; 

Figure 2 is a flow chart which shows the steps car- 
ried out in the system depicted in Figure 1 ; and 

Figure 3 is a flow chart which provides greater detail 
of the confidence level indicator process. 

DETAILED DESCRIPTION OF THE INVENTION 

[0011] Figure 1 shows a system and method for dis- 
playing words with attributes that are correlated to con- 
fidence levels. A human speaker talks into a microphone 
(170). The microphone transmits an acoustic (speech) 
signal to a speech engine process (1 60). The speech 
engine process may be either software or a combination 
of software and hardware, which digitizes the incoming 
acoustic signal and performs a recognition function 
(190). The recognition function (190) translates the 
acoustic signal into text, i.e. one or more words. This 
recognition and translation may be accomplished in a 
number of different ways which are well known to those 
in the field. Each word is assigned a confidence level 
score by a confidence level scorer (200). This confi- 
dence level score is assigned using an algorithm to de- 
termine the level of accuracy with which the recognizer 
(190) determines it has translated the acoustic (speech) 
signal to text. Each word and its assigned confidence 



level score form a word/score (210) pair, each of whkih 
is sent to a graphical user interface (GUI) application 
(150). The GUI application (150) may receive informa- 
tion from a user control (140) to enable a user of the 

5 system to select score thresholds, above which (or be- 
k>w which) default attributes are used in displaying the 
words. The user may also provide informatbn, via the 
use control (140), to control which cotour maps and/or 
attribute maps are used to display the words. The use 

10 ol the thresholds and maps will be discussed In more 
detail below. 

[001 2] Having received the word/score pairs, GUI ap- 
plication (150) uses a Confidence Level Indicator Proc- 
ess (CLIP) (180) along with information from the user 

'5 control (140), If any, to assign a colour and/or an at- 
tribute to each word (110. 120, 130). The CLIP is a map- 
ping algorithm which takes the score which was as- 
signed by the confidence level scorer (200) and deter- 
mines what colour and/or attribute should be associated 

20 with that score. The resulting colour and/or attribute 
used to display the word then reflects the level of accu- 
racy with which the recognizer determines it has trans- 
lated the acoustic (speech) signal into text. 
[0013] The colour selected might be from a map of a 

25 range of different colours or might be from a map of dif- 
ferent shades of a single cobur. Additionally, the at- 
tribute selected may include lealures such as font type, 
point size, bold, italics, underline, double underline, cap- 
italization, flashing, blinking, or a combination of any of 

30 these features. Once a word and its associated colour 
and/or attribute are determined for each word, the pairs 
are then displayed on an output device (105), with each 
word being displayed with its associated colour and/or 
attribute (110, 120, 130). 

35 [0014] Figure 2 shows, in a flow chart form, the steps 
which are carried out in the embodiment described in 
connection with Figure 1 . Figure 2 shows that the acous- 
tic (speech) signal generated by a speaker speaking into 
a microphone Is sent to the speech engine process 

<o (160) containing a recognizer (190) for decoding the 
acoustic signal to text or words as well as a confidence 
level scorer (200) lor assigning a score to the words. 
This score reflects the level of confidence the speech 
recognition system has in its translation of the proo- 
fs essed acoustic signals. Each word, with its associated 
score is then sent from the confidence level scorer (200) 
in the speech engine process (160) to graphical user 
application (150). The graphical user application (150) 
may accept information from the user control (140) to 

50 control the threshold and cote>ur and/or attribute nr^ap- 
plng and use that information in the CLIP (180) within 
the graphical user application (150). The CLIP (180) 
then assigns a colour and/or attribute to each word 
based upon the score given to each word and based 

55 upon the information from the user, if any. Thus, the 
graphical user interface application (150) has as Its out- 
put each word with an associated colour and/or at- 
tribute. This information is then used to display the word 
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with the associated colour and/or attribute, which, in 
turn, is an indication of the confidence level associated 
with each word. 

[001 5] Figure 3 depicts a flow chart showing nnore de- 
tail of CLIP (180 in Figures 1 and 2). A word/score pair 
(210) is received by the CUP (180) which assigns a de- 
fault colour and font attribute lo the word (181). The word 
and its score are reviewed (182). If the word is above 
the threshold it is displayed with the default colour and 
attribute (220). If the score is below the threshold (1 41 ), 
which may be defined by a user or defined by the sys- 
tem, the word and its associated score go to a process 
that checks for colour mapping (183). When a colour 
map (240) is used, the appropriate colour (determined 
by the word's score) is mapped to the word (185). Irre- 
spective of whether colour mapping is used, the process 
checks whether the attribute mapping of the word needs 
lo be changed based on the score (184). If so, the at- 
tribute mapping process (184) maps the correct font at- 
tribute based on the score (186) using an attribute map 
(230). The word, with colour and attribute if appropriate, 
then are displayed (220). 

[0016] Variants to the invention are possible. For ex* 
ample, in the flow chart of Figure 3, colour and/or at- 
tribute mapping may be carried out if the word/score pair 
is above, rather than below a threshold. Also, colour 
mapping or attribute mapping may be carried out atone, 
rather than serially. That is, either colour mapping or at- 
tribute mapping may be used alone. 



Claims 

1. A speech recognition system comprising: 

a speech recognizer for translating speech into 
text, said text being one or more words, said 
speech recognizer further comprising a confi- 
dence level scorer (200) for assigning one of at 
least three possible scores for each of said one 
or more words, said score being a confidence 
measure that said one or more words has been 
recognized correctly; and 
a user interlace (1 50) (or displaying said one or 
more words, each of said one or more words 
having display properties based on said scores. 

2. A speech recognition system as claimed in claim 1 , 
wherein said different display properties Include a 
default display property and two or more other dis- 
play properties. 

3. A speech recognition system as claimed in claim 2, 
wherein said default display property is normal text. 

4. A speech recognition system as claimed in claim 2, 
wherein said one or more words is displayed with 
one of said two or more other display properties 



when said confidence measure is below a thresh- 
old, thereby Indicating a possible error. 

5. A speech recognition system as claimed in claim 4. 
5 wherein said threshold level is selected by a user 

of said speech recognition system. 

6. A speech recognition system as claimed in claim 2, 
wherein said one or more words are displayed with 

'0 said default display property when said confidence 
measure is above a threshold level. 

7. A speech recognition system as claimed in claim 1 , 
wherein each of said different display properties is 

rs a different colour. 

8. A speech recognition system as claimed in claim 1 , 
wherein each of said different display properties is 
at least one different font attribute selected from the 

20 group consisting of font type, point size, bold, italics, 
underline, double underline, capitalization, flashing 
and blinking. 

9. A speech recognition system as claimed In claim 1 , 
25 wherein each of different display properties is one 

of a different shade of a single or a different shade 
of gray. 

10. A speech recognition system as claimed in claim 5, 
30 wherein said threshold selection enables said user 

to select one of a colour map or a gray scale map 
to identify which one of said at least three possible 
scores is assigned to each of said one or more 
words. 

35 

11. A method of speech recognition comprising: 

translating input speech into text, said text be- 
ing one or nr»ore words; 

4^ assigning one of at least three possible confi- 

dence level scores for each of said one or nr>ore 
words, said score being a confidence measure 
that said one or more words has been recog- 
nized correctfy; and 

45 displaying said one or more words based on 

said assigning step, each of said one or more 
words having display properties based on said 
scores. 

so 1 2. A method of speech recognition as claimed in claim 

11, wherein said one or more words is displayed 
with one of said two or more other display properties 
when said confidence measure of sakJ one or more 
words is be tow a threshold level. 

55 

13. A method of speech recognition as claimed in claim 

12, further comprising the step of: 

providing user selectability of said threshold 
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